Introduction to scRNAseq analysis using scanpy

Workshop

Author
Affiliation
Irene Robles (irenerobles93@gmail.com)

CoSyne Therapeutics

Published

March 10, 2024

Workshop access

Github:

https://github.com/IreneRobles/scrnaseq_workshops/

Web link:

https://tinyurl.com/robles-introduction-to-scanpy

QR code:

Introduction

Single-cell RNA sequencing (scRNA-seq)

  • scRNA-seq is a powerful technology for sequencing individual cells’ transcriptomes.
  • Provides high-resolution insights into cell-to-cell variability.
  • Enables study of cellular diversity within tissues.
  • Aids in identification of rare cell types.

scRNAseq vs bulk RNAseq

  • RNA-seq provides a combined overview of gene expression from all cells, masking individual differences (similar to the combined flavor of a smoothie)
  • scRNA-seq provides a detailed profile of each cell’s gene expression (like tasting each fruit separately),

Bulk RNA-seq, LyfeFuel from Unsplash

scRNA-seq, VD Photography from Unsplash

Sample representation

  • In bulk data, each sample is repressented by a vector, where each value is a gene.
  • In single cell data, each sample is a matrix, where each row is a gene and each column is a cell.

\[\begin{align} Bulk &= \begin{bmatrix} gene_{1} \\ gene_{2} \\ gene_{3}\\ \vdots \\ gene_{n} \end{bmatrix} \\ \\ Single-cell &= \begin{bmatrix} gene_1, cell_1 & gene_1, cell_2 & gene_1, cell_3 & \dots & gene_1, cell_m \\ gene_2, cell_1 & gene_2, cell_2 & gene_2, cell_3 & \dots & gene_2, cell_m \\ gene_3, cell_1 & gene_3, cell_2 & gene_3, cell_3 & \dots & gene_3, cell_m \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ gene_n, cell_1 & gene_n, cell_2 & gene_n, cell_3 & \dots & gene_n, cell_m \end{bmatrix} \end{align}\]

Advantages and limitations

Feature Bulk data Single-cell data
Cell resolution Average of all cells Individual cell resolution
Sample representation Vector of gene expression values Matrix of gene expression values
Genomic resolution Higher, depends on sequencing depth Lower, depends on starting material
Cost Lower High
Computational requirements Lower Higher
Data size Lower Higher
Data interpretation Simple Complex

scRNAseq packages: Scanpy vs Seurat

  • scanpy is a Python package for single-cell analysis (Wolf, Angerer, and Theis 2018)
  • Seurat is an R package for single-cell analysis (Satija et al. 2015)
  • Both are:
    • User-friendly tools for single-cell analysis
    • Open source
    • Well-documented
    • Widely-used
  • Choice depends on:
    • Language preference
    • Team expertise
    • Integration with downstream analysis
    • Speed and memory requirements

Hint: A good bioinformatician is not restricted by language. You can use R in Python can be done using the rpy2 package. And Python can be use within R using reticulate.

Scanpy: the AnnData object

Scanpy’s AnnData is a flexible container designed for single-cell gene expression data and annotations. Key features include:

  • Data Matrix: Holds gene expression values with cells as rows and genes as columns. Annotations: Allows adding detailed annotations for both cells and genes.
  • Multidimensional Annotations: Supports storing multiple matrices for processed and raw data.
  • Unstructured Data: Provides space for additional information like quality metrics or analysis results.

AnnData object, source: scanpy web

Set up

Go to Google Colab

Open a notebook.

Open an notebook from Github.

Introduce https://github.com/IreneRobles/scrnaseq_workshops and select workshop_notebook.ipynb

If you run notebook set up commands, it should install required packages and download test data.

Download repo

git clone https://github.com/IreneRobles/scrnaseq_workshops.git
cd /scrnaseq_workshops

Create environment

conda create -n myscanpy python=3.10
conda activate myscanpy
pip install -r requirements.txt

Register jupyter kernel

python -m ipykernel install --user --name myscanpy --display-name "Python (myscanpy)"

Download test data

mkdir data
cd data
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE190nnn/GSE190622/suppl/GSE190622%5Fcount%5Fmatrix%5FAnnotated.csv.gz

Import libraries

import scanpy as sc
import scrublet as scr
import numpy as np
import os
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns

Scanpy setttings

sc.settings.verbosity = 3   # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')
scanpy==1.9.8 anndata==0.10.5.post1 umap==0.5.5 numpy==1.26.4 scipy==1.12.0 pandas==2.2.1 scikit-learn==1.4.1.post1 statsmodels==0.14.1 igraph==0.11.4 pynndescent==0.5.11

Load tests datasets

Single-cell SMART-seq data from mouse macrophages (Robles-Rebollo et al. 2022)

mf = pd.read_csv("data/GSE190622_count_matrix_Annotated.csv.gz", index_col=0)
adata = sc.AnnData(X=mf.T)
adata.obs["genotype"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[0])
adata.obs["timepoint"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[1])
adata.obs["sample"] = adata.obs["genotype"] + "_" + adata.obs["timepoint"]
adata.var_names_make_unique()
adata
AnnData object with n_obs × n_vars = 1362 × 18429
    obs: 'genotype', 'timepoint', 'sample'

Doublet detection

  • A doublet is a single-cell transcriptome that has been generated by two cells
  • They have often have a higher read and gene count
  • They can produce confounding results

Doublets, adapted from (Wolock, Lopez, and Klein 2019)

How do we fight doublets:

  • Experiment design
  • Computationally

Scrublet

Scrublet finds doublets based os simulations (Wolock, Lopez, and Klein 2019)

Scrublet algorithm overview (Wolock, Lopez, and Klein 2019)

scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
scrub.plot_histogram()
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.41
Detected doublet rate = 0.1%
Estimated detectable doublet fraction = 25.8%
Overall doublet rate:
    Expected   = 6.0%
    Estimated  = 0.3%
Elapsed time: 24.2 seconds
(<Figure size 640x240 with 2 Axes>,
 array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
        <Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
       dtype=object))

Adjust threshold

predicted_doublets = scrub.call_doublets(threshold=0.2)
scrub.plot_histogram()
Detected doublet rate = 0.7%
Estimated detectable doublet fraction = 50.8%
Overall doublet rate:
    Expected   = 6.0%
    Estimated  = 1.4%
(<Figure size 640x240 with 2 Axes>,
 array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
        <Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
       dtype=object))

Visualise doublets

scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
scrub.plot_embedding('UMAP', order_points=True);

Filter out doublets

adata.obs["scrublet_doublet"] = predicted_doublets
adata = adata[adata.obs["scrublet_doublet"].apply(lambda x: x is False)]

Quality metrics

Highest expressed genes

Look for suspects: MALAT1, mitochondrial genes, ribosomal genes, componenets of the cytoskeleton, etc.

sc.pl.highest_expr_genes(adata, n_top=20)
normalizing counts per cell
    finished (0:00:00)

Counts, genes and mitochondrial percent

sc.pp.calculate_qc_metrics computes quality control metrics for each cell.

  • Number of counts per cell
  • Number of genes per cell
  • Percentage of counts that come from mitochondrial genes
# Mitochondrial genes
adata.var['mt'] = adata.var_names.str.startswith('mt-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True, rotation = 90)

Different sampes might require different thresholds

sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True, groupby='sample', rotation = 90)

sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', color='sample')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='sample')

In this case, cells from different samples are reasonably homogeneous. However, that is not always the cells_per_paper.ipynb

Example case where different tissues might require different quality control threshols, fraction of Tabula Muris data (Schaum et al. 2018)

QC filtering

Filter cells based on quality metrics

sc.pp.filter_cells(adata, min_genes=500)
sc.pp.filter_genes(adata, min_cells=5)
adata= adata[adata.obs.n_genes_by_counts < 5000, :]
adata= adata[adata.obs.pct_counts_mt < 15, :]
filtered out 1 genes that are detected in less than 5 cells

Normalisation

For simplicity, we will use the simplest method: library size normalisation and log transformation, but there are others Hafemeister and Satija (2019).

Steps:

  • Transcript length normalisation:
    • Adjusts for differences in transcript length between genes
    • Divides the counts in each cell by the length of the transcript
    • Not necessary if you sequence a fixed region of the transcript ( 3’ or 5’ in 10X, our case)
  • Library size normalisation:
    • Adjusts for differences in sequencing depth between cells
    • Divides the counts in each cell by the total counts in that cell and multiplies by a scale factor (e.g. 10,000)
sc.pp.normalize_total(adata, target_sum=1e4)
normalizing counts per cell
    finished (0:00:00)
  • Log transformation:
    • Logarithm of the normalised counts
    • Makes the data more normally distributed
sc.pp.log1p(adata)

Highly variable genes

  • Genes that show the most variation across cells
  • Often the most informative genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)

Dimensionality reduction

Goal

  • Reducing the number of input variables or features in a dataset.

Why

  • Noise Reduction: Focus on the most variance-contributing features, thus enhancing the signal-to-noise ratio.
  • Data Visualization: We can plot in 2 or 3 dimensions
  • Aid clustering and clasification
  • Computational efficiency

Prepare data for dimensionality reduction

The .raw attribute freezes the state of the AnnData object for later use. We can recumerate it by calling .raw.to_adata().

adata.raw = adata

Filter by highly variable genes

adata= adata[:, adata.var.highly_variable]
adata
View of AnnData object with n_obs × n_vars = 1277 × 3193
    obs: 'genotype', 'timepoint', 'sample', 'scrublet_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'sample_colors', 'log1p', 'hvg'

Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed.

sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
regressing out ['total_counts', 'pct_counts_mt']
    finished (0:00:02)

Scale each gene to unit variance. Clip values exceeding standard deviation to 10.

sc.pp.scale(adata, max_value=10)

Dimensionality reduction techniques

  • Principal Component Analysis (PCA): PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.
  • t-Distributed Stochastic Neighbor Embedding (t-SNE): Popular dimensionality reduction that foccusses on the preservation on local structures
  • Uniform Manifold Approximation and Projection (UMAP): Popular dimensionality reduction that foccusses on the preservation on global structures

Principal component analysis (PCA)

PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.

sc.tl.pca(adata, svd_solver='arpack')
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:34)
sc.pl.pca(adata, color='sample')

Varinace explained by each component:

sc.pl.pca_variance_ratio(adata, log=True)

UMAP and t-SNE

First, we need to compute the neighbourhood graph:

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=20)
computing neighbors
    using 'X_pca' with n_pcs = 20
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)

Compute UMAP:

sc.tl.umap(adata)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:01)

Compute t-SNE:

sc.tl.tsne(adata, n_pcs = 20)
computing tSNE
    using 'X_pca' with n_pcs = 20
    using sklearn.manifold.TSNE
    finished: added
    'X_tsne', tSNE coordinates (adata.obsm) (0:00:03)

Plot UMAP:

sc.pl.umap(adata, color='sample')

Plot t-SNE:

sc.pl.tsne(adata, color='sample')

Clustering

Leiden algorithm. The Leiden algorithm starts from a singleton partition (a). The algorithm moves individual nodes from one community to another to find a partition (b), which is then refined (c). An aggregate network (d) is created based on the refined partition, using the non-refined partition to create an initial partition for the aggregate network. For example, the red community in (b) is refined into two subcommunities in (c), which after aggregation become two separate nodes in (d), both belonging to the same community. The algorithm then moves individual nodes in the aggregate network (e). In this case, refinement does not change the partition (f). These steps are repeated until no further improvements can be made. Figure 3 from (Traag, Waltman, and Van Eck 2019)

sc.tl.leiden(adata)
sc.pl.umap(adata, color=['sample','leiden'], use_raw=False)
running Leiden clustering
    finished: found 11 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

Marker genes

Genes differentially expressed in each cluster.

sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:01)

pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
0 1 2 3 4 5 6 7 8 9 10
0 Ccl5 Mapkapk2 Ctsd Sepp1 Ccnd1 Tnfsf9 Gm12155 Hmgb2 Hist1h2ap Ddit3 Tubb5
1 Il1rn Ebi3 Laptm5 Tnfsf9 Lgals3 Cxcl2 Gm13341 Tuba1b Hist3h2a Atf4 Cdca3
2 Isg15 Il1b Gm42418 Nfkbiz Ifi27l2a Tnf Gm42836 Tnf Trim17 Hspa5 Ube2c
3 Mndal Serp1 Lamp1 Cxcl1 Rnh1 Ccl4 Gm29216 Tubb5 H2afx Slc3a2 Birc5
4 Saa3 Ehd1 Sepp1 Serinc3 Lpl Id2 Mak Top2a Hist1h2aa Sqstm1 Plk1

Plot some markers:

sc.pl.umap(adata, color=['leiden', 'Ccl5', 'Mapkapk2', 'Ctsd', 'Sepp1', 'Ccnd1','Tnfsf9','Gm12155','Hmgb2','Hist1h2ap','Ddit3','Tubb5'], use_raw=True)

Session info

!pip install session-info
import session_info
session_info.show()
Looking in indexes: https://pypi.org/simple, https://__token__:****@gitlab.com/api/v4/groups/54410195/-/packages/pypi/simple
Requirement already satisfied: session-info in /Users/ireneroblesrebollo/anaconda3/lib/python3.10/site-packages (1.0.0)
Requirement already satisfied: stdlib-list in /Users/ireneroblesrebollo/anaconda3/lib/python3.10/site-packages (from session-info) (0.8.0)
Click to view session information
-----
anndata             0.10.5.post1
matplotlib          3.8.3
numpy               1.26.4
pandas              2.2.1
polars              0.20.13
scanpy              1.9.8
scrublet            NA
seaborn             0.13.2
session_info        1.0.0
-----
Click to view modules imported as dependencies
PIL                         10.2.0
annoy                       NA
anyio                       NA
appnope                     0.1.4
arrow                       1.3.0
asttokens                   NA
attr                        23.2.0
attrs                       23.2.0
babel                       2.14.0
certifi                     2024.02.02
cffi                        1.16.0
charset_normalizer          3.3.2
comm                        0.2.1
cycler                      0.12.1
cython_runtime              NA
dateutil                    2.9.0
debugpy                     1.8.1
decorator                   5.1.1
defusedxml                  0.7.1
exceptiongroup              1.2.0
executing                   2.0.1
fastjsonschema              NA
fqdn                        NA
h5py                        3.10.0
idna                        3.6
igraph                      0.11.4
ipykernel                   6.29.3
ipywidgets                  8.1.2
isoduration                 NA
jedi                        0.19.1
jinja2                      3.1.3
joblib                      1.3.2
json5                       NA
jsonpointer                 2.4
jsonschema                  4.21.1
jsonschema_specifications   NA
jupyter_events              0.9.0
jupyter_server              2.12.5
jupyterlab_server           2.25.3
kiwisolver                  1.4.5
lazy_loader                 NA
leidenalg                   0.10.2
llvmlite                    0.42.0
markupsafe                  2.1.5
matplotlib_inline           0.1.6
mpl_toolkits                NA
natsort                     8.4.0
nbformat                    5.9.2
numba                       0.59.0
overrides                   NA
packaging                   23.2
parso                       0.8.3
patsy                       0.5.6
pexpect                     4.9.0
platformdirs                4.2.0
prometheus_client           NA
prompt_toolkit              3.0.43
psutil                      5.9.8
ptyprocess                  0.7.0
pure_eval                   0.2.2
pyarrow                     15.0.0
pycparser                   2.21
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.17.2
pynndescent                 0.5.11
pyparsing                   3.1.1
pythonjsonlogger            NA
pytz                        2024.1
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
scipy                       1.12.0
send2trash                  NA
six                         1.16.0
skimage                     0.22.0
sklearn                     1.4.1.post1
sniffio                     1.3.1
stack_data                  0.6.3
statsmodels                 0.14.1
texttable                   1.7.0
threadpoolctl               3.3.0
tornado                     6.4
tqdm                        4.66.2
traitlets                   5.14.1
typing_extensions           NA
umap                        0.5.5
uri_template                NA
urllib3                     2.2.1
wcwidth                     0.2.13
webcolors                   1.13
websocket                   1.7.0
yaml                        6.0.1
zmq                         25.1.2
zoneinfo                    NA
-----
IPython             8.22.1
jupyter_client      8.6.0
jupyter_core        5.7.1
jupyterlab          4.1.2
notebook            7.1.1
-----
Python 3.10.13 (main, Sep 11 2023, 08:16:02) [Clang 14.0.6 ]
macOS-13.3.1-arm64-arm-64bit
-----
Session information updated at 2024-03-10 14:45

Inspirations

References

Hafemeister, Christoph, and Rahul Satija. 2019. “Normalization and Variance Stabilization of Single-Cell RNA-Seq Data Using Regularized Negative Binomial Regression.” Genome Biology 20 (1): 296.
Risso, Davide, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean-Philippe Vert. 2018. “A General and Flexible Method for Signal Extraction from Single-Cell RNA-Seq Data.” Nature Communications 9 (1): 284.
Robles-Rebollo, Irene, Sergi Cuartero, Adria Canellas-Socias, Sarah Wells, Mohammad M Karimi, Elisabetta Mereu, Alexandra G Chivu, et al. 2022. “Cohesin Couples Transcriptional Bursting Probabilities of Inducible Enhancers and Promoters.” Nature Communications 13 (1): 4342.
Satija, Rahul, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. 2015. “Spatial Reconstruction of Single-Cell Gene Expression Data.” Nature Biotechnology 33 (5): 495–502.
Schaum, Nicholas, Jim Karkanias, Norma F Neff, Andrew P May, Stephen R Quake, Tony Wyss-Coray, Spyros Darmanis, et al. 2018. “Single-Cell Transcriptomics of 20 Mouse Organs Creates a Tabula Muris: The Tabula Muris Consortium.” Nature 562 (7727): 367.
Svensson, Valentine, Eduardo da Veiga Beltrame, and Lior Pachter. 2020. “A Curated Database Reveals Trends in Single-Cell Transcriptomics.” Database 2020: baaa073.
Traag, Vincent A, Ludo Waltman, and Nees Jan Van Eck. 2019. “From Louvain to Leiden: Guaranteeing Well-Connected Communities.” Scientific Reports 9 (1): 5233.
Wolf, F Alexander, Philipp Angerer, and Fabian J Theis. 2018. “SCANPY: Large-Scale Single-Cell Gene Expression Data Analysis.” Genome Biology 19: 1–5.
Wolock, Samuel L, Romain Lopez, and Allon M Klein. 2019. “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.” Cell Systems 8 (4): 281–91.

Reuse